Add Workflow Registry Metrics and remove executions running counter #16463

patrickhuie19 · 2025-02-18T21:49:40Z

The executions running gauge can be misleading since it is updated on heartbeat, which is async to workflow deletes through the workflow registry.

I saw we're lacking some metrics in the workflow registry and so added coverage.

…initive

github-actions · 2025-02-18T21:50:46Z

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

#added For any new functionality added.
#breaking_change For any functionality that requires manual action for the node to boot.
#bugfix For bug fixes.
#changed For any change to the existing functionality.
#db_update For any feature that introduces updates to database schema.
#deprecation_notice For any upcoming deprecation functionality.
#internal For changesets that need to be excluded from the final changelog.
#nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#removed For any functionality/config that is removed.
#updated For any functionality that is updated.
#wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

🎖️ No JIRA issue number found in: PR title, commit message, or branch name. Please include the issue ID in one of these.

patrickhuie19 · 2025-02-18T21:52:25Z

core/services/workflows/engine.go

@@ -1153,7 +1153,6 @@ func (e *Engine) heartbeat(ctx context.Context) {
 			return
 		case <-ticker.C:
 			e.metrics.incrementEngineHeartbeatCounter(ctx)
-			e.metrics.updateTotalWorkflowsGauge(ctx, e.stepUpdatesChMap.len())


I am removing this metric entirely. Read PR description for why

Hmmm... this still feels like it could be useful @patrickhuie19 -- I can imagine scenarios where the number of executions is just increasing, and not going down such that we'll eventually exhaust the queue. In that case this metric would help us diagnose what's going on.

You mentioned that this relates to deletes of the workflow engine -- can you say more about how? I'm not following entirely

Note that one source of confusion here is that this name of this metric (total workflows running) doesn't correspond to what it's measuring (number of executions running an in instance of the workflow engine); maybe that's the source of confusion here?

It seems to be more confusing than helpful. The name is part of it. The metric is updated on a heartbeat, which is async to engine.Start() and engine.Close().

I can imagine scenarios where the number of executions is just increasing, and not going down such that we'll eventually exhaust the queue

Thanks for the example use case. Would we be able to satisfy this by looking at the completed duration histogram metrics and observing no executions are finishing?

patrickhuie19 · 2025-02-18T21:52:35Z

core/services/workflows/engine.go

@@ -1266,7 +1265,7 @@ const (
 	defaultQueueSize            = 100000
 	defaultNewWorkerTimeout     = 2 * time.Second
 	defaultMaxExecutionDuration = 10 * time.Minute
-	defaultHeartbeatCadence     = 5 * time.Minute


5 minutes is anemic

patrickhuie19 · 2025-02-18T21:57:44Z

core/services/workflows/syncer/handler.go

-// tryEngineCleanup attempts to stop the workflow engine for the given workflow ID.  Does nothing if the
-// workflow engine is not running.
-func (h *eventHandler) tryEngineCleanup(wfID string) error {
-	if h.engineRegistry.IsRunning(wfID) {


This had logic based on whether or not the underlying Ready implementation in the embedded services.StateMachine would return an error. The Engine has no subservices, Close() isn't marked as syncronous with Ready in services.StateMachine, and if the Engine wasn't Ready, we would do nothing anyways.

If we want to stay flexible to checking the status of Ready in the future, we could add some type of sync loop to check status and delete when Ready?

@patrickhuie19 Are we subtly changing behaviour here? We'll return an error now if we can't retrieve the engine from the registry; whereas previous we would have ignored that

Yes. I'll change it so that engineCleanup errors under the same condition as previously to keep the scope of this PR smaller. The tradeoff with handling a lacking engine instance this gracefully is that it might fly under our radar when engine instances are getting cleared from the registry before they should.

github-actions · 2025-02-18T22:03:12Z

AER Report: CI Core

aer_workflow , commit , Clean Go Tidy & Generate , Detect Changes , Scheduled Run Frequency , Core Tests (go_core_tests) , Core Tests (go_core_tests_integration) , Core Tests (go_core_ccip_deployment_tests) , GolangCI Lint (.) , Core Tests (go_core_fuzz) , Core Tests (go_core_race_tests) , test-scripts , lint , SonarQube Scan

1. GolangCI Lint job failed due to non-wrapping format verb in fmt.Errorf

[A 1 <= 10 words sentence that describes the error]:[job id where the error happened]
Source of Error:

Source of Error:

Golang Lint (.)	2025-02-27T21:20:27.9803622Z ##[error]core/services/workflows/delegate.go:50:167: non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)
Golang Lint (.)	2025-02-27T21:20:27.9805430Z 		return fmt.Errorf("delegate failed to unregister workflow engine for workflow name: %s id: %s: %v", spec.WorkflowSpec.WorkflowName, spec.WorkflowSpec.WorkflowName, err)
Golang Lint (.)	2025-02-27T21:20:27.9810742Z 		 ^
Golang Lint (.)	2025-02-27T21:20:27.9812508Z ##[error]core/services/workflows/delegate.go:101:149: non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)
Golang Lint (.)	2025-02-27T21:20:27.9814103Z 		return nil, fmt.Errorf("delegate failed to register workflow engine for workflow name: %s id: %s: %v", cfg.WorkflowName.String(), cfg.WorkflowID, err)
Golang Lint (.)	2025-02-27T21:20:27.9818405Z 		 ^

Why: The error is caused by the use of %v instead of %w in fmt.Errorf for wrapping errors. The %w verb is required to wrap errors in Go.

Suggested fix: Replace %v with %w in the fmt.Errorf calls to properly wrap the errors. For example, change fmt.Errorf("... %v", err) to fmt.Errorf("... %w", err).

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

bolekk

Maybe I'm too tired but I'm really struggling to make sense of the description of this PR.

What I see is going on:

You are removing "workflowsRunningGauge" because it is pretty useless. It's counting current number of executions at a given point in time, updated only with the heartbeat, correct?
You are adding a bunch of new counters to track all events in the Workflow Registry. Those make sense to me, especially "totalWorkflowsGauge". It won't cover workflows added outside the Registry but I guess that's fine.

What I don't understand is how this is going to help debug why metrics are lingering after workflow deletion. You say that other metrics hang around too, not only the executions gauge. Are you able to reproduce it locally? What new signal are you expecting to get once we start running this in prod?

patrickhuie19 · 2025-02-22T02:47:47Z

Maybe I'm too tired but I'm really struggling to make sense of the description of this PR.

What I see is going on:
1. You are removing "workflowsRunningGauge" because it is pretty useless. It's counting current number of executions at a given point in time, updated only with the heartbeat, correct?

2. You are adding a bunch of new counters to track all events in the Workflow Registry. Those make sense to me, especially "totalWorkflowsGauge". It won't cover workflows added outside the Registry but I guess that's fine.
What I don't understand is how this is going to help debug why metrics are lingering after workflow deletion. You say that other metrics hang around too, not only the executions gauge. Are you able to reproduce it locally? What new signal are you expecting to get once we start running this in prod?

Thanks, simplified the description. Your points 1 and 2 are correct. There were multiple levers into the changes in this PR, and they are worthwhile on their own outside of the context of the time-series persisting post deletion bug we're tracking down.

bolekk

Thanks!

cedric-cordenier · 2025-02-24T15:51:37Z

core/services/workflows/syncer/handler.go

+		metrics.incrementRegisterCounter(ctx)
+
+		// intentionally without workflow specific labels
+		h.metrics.updateTotalWorkflowsGauge(ctx, int64(h.engineRegistry.Size()))


Yes but slightly misleading: this really only measures workflows that have been started by the syncer -- we should add a increment in the workflows/delegate to give an accurate picture

core/services/workflows/syncer/handler.go

krehermann · 2025-02-24T16:31:46Z

core/services/workflows/syncer/handler.go

+
+	m, err := initMonitoringResources()
+	if err != nil {
+		lggr.Fatalw("Failed to initialize monitoring resources", "err", err)


hmm. this will kill the node. that seems unreasonable, we don't know what else is running.

where is this init all use to live?

if it really should block the whole node, then panic

if not, then critical

For this particular error to get invoked it signals a potentially catastrophic regression in Beholder impl. But agreed we don't know what else is running

…ound to make that possible

patrickhuie19 added 2 commits February 18, 2025 15:47

changed engineCleanup from best effort based on service health to def…

06e3418

…initive

removing total executions running metric

b1b54d4

patrickhuie19 requested a review from a team as a code owner February 18, 2025 21:49

patrickhuie19 requested a review from leeyikjiun February 18, 2025 21:50

patrickhuie19 changed the title ~~changed engineCleanup from best effort based on service health to def…~~ Engine Syncer Change + Metrics cleanup Feb 18, 2025

patrickhuie19 requested a review from cedric-cordenier February 18, 2025 21:50

patrickhuie19 commented Feb 18, 2025

View reviewed changes

patrickhuie19 requested a review from Atrax1 February 18, 2025 21:58

patrickhuie19 added 2 commits February 18, 2025 17:23

removing test in syncer relying on engine.Ready before Close

f928260

Adding metrics to workflow registry syncer and lacking test coverage

b7c5a58

patrickhuie19 force-pushed the fix/execution-num-metric branch from 86e8af2 to b7c5a58 Compare February 22, 2025 00:21

patrickhuie19 changed the title ~~Engine Syncer Change + Metrics cleanup~~ Add Workflow Registry Metrics to help chase down possible deletion bug Feb 22, 2025

bolekk reviewed Feb 22, 2025

View reviewed changes

patrickhuie19 changed the title ~~Add Workflow Registry Metrics to help chase down possible deletion bug~~ Add Workflow Registry Metrics and remove executions running counter Feb 22, 2025

fixing test

d76fe74

bolekk previously approved these changes Feb 22, 2025

View reviewed changes

cedric-cordenier reviewed Feb 24, 2025

View reviewed changes

core/services/workflows/syncer/handler.go Outdated Show resolved Hide resolved

cedric-cordenier reviewed Feb 24, 2025

View reviewed changes

core/services/workflows/syncer/handler.go Show resolved Hide resolved

krehermann reviewed Feb 24, 2025

View reviewed changes

separating engine from workflows metrics. wiring an engineRegistry ar…

e75d117

…ound to make that possible

patrickhuie19 dismissed bolekk’s stale review via e75d117 February 24, 2025 21:57

patrickhuie19 requested review from a team as code owners February 24, 2025 21:57

fixing import cycle. workflow -/-> syncer

8dc7716

patrickhuie19 temporarily deployed to integration February 27, 2025 21:34 — with GitHub Actions Inactive

patrickhuie19 temporarily deployed to integration February 27, 2025 21:35 — with GitHub Actions Inactive

patrickhuie19 temporarily deployed to integration February 27, 2025 21:36 — with GitHub Actions Inactive

patrickhuie19 temporarily deployed to integration February 27, 2025 21:37 — with GitHub Actions Inactive

patrickhuie19 temporarily deployed to Integration February 27, 2025 21:37 — with GitHub Actions Inactive

patrickhuie19 temporarily deployed to integration February 27, 2025 21:37 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Workflow Registry Metrics and remove executions running counter #16463

Add Workflow Registry Metrics and remove executions running counter #16463

patrickhuie19 commented Feb 18, 2025 •

edited

Loading

github-actions bot commented Feb 18, 2025

patrickhuie19 Feb 18, 2025

cedric-cordenier Feb 24, 2025

cedric-cordenier Feb 24, 2025

patrickhuie19 Feb 24, 2025

patrickhuie19 Feb 18, 2025

patrickhuie19 Feb 18, 2025 •

edited

Loading

cedric-cordenier Feb 24, 2025

patrickhuie19 Feb 24, 2025

patrickhuie19 Feb 24, 2025

github-actions bot commented Feb 18, 2025 •

edited

Loading

bolekk left a comment

patrickhuie19 commented Feb 22, 2025

bolekk left a comment

cedric-cordenier Feb 24, 2025

patrickhuie19 Feb 24, 2025

krehermann Feb 24, 2025

krehermann Feb 24, 2025

patrickhuie19 Feb 24, 2025

patrickhuie19 Feb 24, 2025

Add Workflow Registry Metrics and remove executions running counter #16463

Are you sure you want to change the base?

Add Workflow Registry Metrics and remove executions running counter #16463

Conversation

patrickhuie19 commented Feb 18, 2025 • edited Loading

github-actions bot commented Feb 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickhuie19 Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 18, 2025 • edited Loading

AER Report: CI Core

1. GolangCI Lint job failed due to non-wrapping format verb in fmt.Errorf

AER Report: Operator UI CI ran successfully ✅

bolekk left a comment

Choose a reason for hiding this comment

patrickhuie19 commented Feb 22, 2025

bolekk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickhuie19 commented Feb 18, 2025 •

edited

Loading

patrickhuie19 Feb 18, 2025 •

edited

Loading

github-actions bot commented Feb 18, 2025 •

edited

Loading